355 research outputs found
A Computational Framework for Host-Pathogen Protein-Protein Interactions
Infectious diseases cause millions of illnesses and deaths every year, and raise great health concerns world widely. How to monitor and cure the infectious diseases has become a prevalent and intractable problem. Since the host-pathogen interactions are considered as the key infection processes at the molecular level for infectious diseases, there have been a large amount of researches focusing on the host-pathogen interactions towards the understanding of infection mechanisms and the development of novel therapeutic solutions. For years, the continuously development of technologies in biology has benefitted the wet lab-based experiments, such as small-scale biochemical, biophysical and genetic experiments and large-scale methods (for example yeast-two-hybrid analysis and cryogenic electron microscopy approach). As a result of past decades of efforts, there has been an exploded accumulation of biological data, which includes multi omics data, for example, the genomics data and proteomics data.
Thus, an initiative review of omics data has been conducted in Chapter 2, which has exclusively demonstrated the recent update of ‘omics’ study, particularly focusing on proteomics and genomics. With the high-throughput technologies, the increasing amount of ‘omics’ data, including genomics and proteomics, has even further boosted. An upsurge of interest for data analytics in bioinformatics comes as no surprise to the researchers from a variety of disciplines. Specifically, the astonishing rate at which genomics and proteomics data are generated leads the researchers into the realm of ‘Big Data’ research. Chapter 2 is thus developed to providing an update of the omics background and the state-of-the-art developments in the omics area, with a focus on genomics data, from the perspective of big data analytics..
APEX2S: A Two-Layer Machine Learning Model for Discovery of host-pathogen protein-protein Interactions on Cloud-based Multiomics Data
Presented by the avalanche of biological interactions data, computational biology is now facing greater challenges on big data analysis and solicits more studies to mine and integrate cloud-based multiomics data, especially when the data are related to infectious diseases. Meanwhile, machine learning techniques have recently succeeded in different computational biology tasks. In this article, we have calibrated the focus for host-pathogen protein-protein interactions study, aiming to apply the machine learning techniques for learning the interactions data and making predictions. A comprehensive and practical workflow to harness different cloud-based multiomics data is discussed. In particular, a novel two-layer machine learning model, namely APEX2S, is proposed for discovery of the protein-protein interactions data. The results show that our model can better learn and predict from the accumulated host-pathogen protein-protein interactions
Honest Score Client Selection Scheme: Preventing Federated Learning Label Flipping Attacks in Non-IID Scenarios
Federated Learning (FL) is a promising technology that enables multiple
actors to build a joint model without sharing their raw data. The distributed
nature makes FL vulnerable to various poisoning attacks, including model
poisoning attacks and data poisoning attacks. Today, many byzantine-resilient
FL methods have been introduced to mitigate the model poisoning attack, while
the effectiveness when defending against data poisoning attacks still remains
unclear. In this paper, we focus on the most representative data poisoning
attack - "label flipping attack" and monitor its effectiveness when attacking
the existing FL methods. The results show that the existing FL methods perform
similarly in Independent and identically distributed (IID) settings but fail to
maintain the model robustness in Non-IID settings. To mitigate the weaknesses
of existing FL methods in Non-IID scenarios, we introduce the Honest Score
Client Selection (HSCS) scheme and the corresponding HSCSFL framework. In the
HSCSFL, The server collects a clean dataset for evaluation. Under each
iteration, the server collects the gradients from clients and then perform HSCS
to select aggregation candidates. The server first evaluates the performance of
each class of the global model and generates the corresponding risk vector to
indicate which class could be potentially attacked. Similarly, the server
evaluates the client's model and records the performance of each class as the
accuracy vector. The dot product of each client's accuracy vector and global
risk vector is generated as the client's host score; only the top p\% host
score clients are included in the following aggregation. Finally, server
aggregates the gradients and uses the outcome to update the global model. The
comprehensive experimental results show our HSCSFL effectively enhances the FL
robustness and defends against the "label flipping attack.
Look Before You Leap: An Exploratory Study of Uncertainty Measurement for Large Language Models
The recent performance leap of Large Language Models (LLMs) opens up new
opportunities across numerous industrial applications and domains. However,
erroneous generations, such as false predictions, misinformation, and
hallucination made by LLMs, have also raised severe concerns for the
trustworthiness of LLMs', especially in safety-, security- and
reliability-sensitive scenarios, potentially hindering real-world adoptions.
While uncertainty estimation has shown its potential for interpreting the
prediction risks made by general machine learning (ML) models, little is known
about whether and to what extent it can help explore an LLM's capabilities and
counteract its undesired behavior. To bridge the gap, in this paper, we
initiate an exploratory study on the risk assessment of LLMs from the lens of
uncertainty. In particular, we experiment with twelve uncertainty estimation
methods and four LLMs on four prominent natural language processing (NLP) tasks
to investigate to what extent uncertainty estimation techniques could help
characterize the prediction risks of LLMs. Our findings validate the
effectiveness of uncertainty estimation for revealing LLMs'
uncertain/non-factual predictions. In addition to general NLP tasks, we
extensively conduct experiments with four LLMs for code generation on two
datasets. We find that uncertainty estimation can potentially uncover buggy
programs generated by LLMs. Insights from our study shed light on future design
and development for reliable LLMs, facilitating further research toward
enhancing the trustworthiness of LLMs.Comment: 20 pages, 4 figure
Taming Gradient Variance in Federated Learning with Networked Control Variates
Federated learning, a decentralized approach to machine learning, faces
significant challenges such as extensive communication overheads, slow
convergence, and unstable improvements. These challenges primarily stem from
the gradient variance due to heterogeneous client data distributions. To
address this, we introduce a novel Networked Control Variates (FedNCV)
framework for Federated Learning. We adopt the REINFORCE Leave-One-Out (RLOO)
as a fundamental control variate unit in the FedNCV framework, implemented at
both client and server levels. At the client level, the RLOO control variate is
employed to optimize local gradient updates, mitigating the variance introduced
by data samples. Once relayed to the server, the RLOO-based estimator further
provides an unbiased and low-variance aggregated gradient, leading to robust
global updates. This dual-side application is formalized as a linear
combination of composite control variates. We provide a mathematical expression
capturing this integration of double control variates within FedNCV and present
three theoretical results with corresponding proofs. This unique dual structure
equips FedNCV to address data heterogeneity and scalability issues, thus
potentially paving the way for large-scale applications. Moreover, we tested
FedNCV on six diverse datasets under a Dirichlet distribution with {\alpha} =
0.1, and benchmarked its performance against six SOTA methods, demonstrating
its superiority.Comment: 14 page
Real-time Management of groundwater resource based on wireless sensor networks
Groundwater plays a vital role in the arid inland river basins, in which the groundwater management is critical to the sustainable development of area economy and ecology. Traditional sustainable management approaches are to analyze different scenarios subject to assumptions or to construct simulation–optimization models to obtain optimal strategy. However, groundwater system is time-varying due to exogenous inputs. In this sense, the groundwater management based on static data is relatively outdated. As part of the Heihe River Basin (HRB), which is a typical arid river basin in Northwestern China, the Daman irrigation district was selected as the study area in this paper. First, a simulation–optimization model was constructed to optimize the pumping rates of the study area according to the groundwater level constraints. Three different groundwater level constraints were assigned to explore sustainable strategies for groundwater resources. The results indicated that the simulation–optimization model was capable of identifying the optimal pumping yields and satisfy the given constraints. Second, the simulation–optimization model was integrated with wireless sensors network (WSN) technology to provide real-time features for the management. The results showed time-varying feature for the groundwater management, which was capable of updating observations, constraints, and decision variables in real time. Furthermore, a web-based platform was developed to facilitate the decision-making process. This study combined simulation and optimization model with WSN techniques and meanwhile attempted to real-time monitor and manage the scarce groundwater resource, which could be used to support the decision-making related to sustainable management
- …